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Abstract. Let D — {di, d2, ds, do} he a. given set of D documents of total length A''. The top-K document 
retrieval problem is to index V such that when a pattern P and a parameter K comes as a query, the index 
returns the K most relevant documents to the pattern P (in sorted order). Hon et al. [15] gave the first 
linear space framework to solve this problem in 0(|-P| + KlogK) time. This was improved by Navarro and 
Nekrich [22] to 0(|f | + K). In these papers, first a pattern matching is done and then the documents are 
retrieved based on the locus of the pattern. During the retrieval phase, the factor 0{P) is used to bound the 
query time. Once one considers word-packing in RAM or external memory model, this factor is no longer 
optimal. Besides many applications require retrieval to be independent of pattern searching. We show a 
linear index which takes strictly 0{K) time, once the locus of pattern match is given. Separately, we also 
give an external memory linear space index taking near-optimal 0{\P\/ B + logg N + \og\og B + K / B) I/Os 
(outputs not sorted). Our results are surprising in the sense that they defy the usual range searching bounds. 
Our techniques also have implications in cache-oblivious model. 



1 Introduction 



The inverted index is the most fundamental data structure in the field of information retrieval [3^. It 
is the backbone of every known search engine today. For each word in any document collection, the 
inverted index maintains a list of all documents in that collection which contain the word. Despite its 
power to answer various types of queries, the inverted index becomes inefficient, for example, when 
queries are phrases instead of words. This inefficiency comes from inadequate use of word orderings in 
query phrases [Mj- Similar problems also occur in applications when word boundaries do not exist or 
cannot be identified deterministically in the documents, like genome sequences in bioinformatics and 
text in many East-Asian languages. These applications call for data structures to answer queries in a 
more general form, that is, (string) pattern matching. Specifically, they demand the ability to identify 
efficiently all the documents that contain a specific pattern as a substring. The usual inverted-index 
approach might require the maintenance of document lists for all possible substrings of the documents. 
This can take quadratic space and hence is neither theoretically interesting nor sensible from a practical 
viewpoint. 

The first frameworks for answering pattern matching (and related) queries were proposed by Matias 
et. al. |20j and Muthukrishnan [21]. Their data structures solve the document listing problem, in which 
a collection D of D documents is required to be indexed so that given a query pattern P, all the 
documents that contain P can be retrieved efficiently. As the pattern can appear in a single document 
multiple times, a major challenge of this problem is that the overall number of the pattern occurrences 
can be much greater than the number ndoc of the result documents. Therefore, it is unaffordable to 
answer a query by enumerating all the occurrences of P. 

Muthukrishnan also initiated the study of relevance metric-based document retrieval [2T], which 
was then formalized by Hon et al. [I5j as the top-K document retrieval problem. Here, instead of all 
the documents that match a query pattern, the problem is to output the K documents most relevant 
to the query in sorted order of relevance score. Relevance metrics considered in the problem can be 
either pattern-independent (eg., PageRank) or -dependent. In the latter case one can take into ac- 
count information like the frequency of the pattern occurrences (or term-frequency of popular tf-idf 
measure) and even the locations of the occurrences (e.g.,min-dist [15] which takes proximity of two 
closest occurrences of pattern as the score). The framework of Hon et al. [15] takes linear space and 
answers the query in 0(|-P| + KlogK) time. This was then improved by Navarro and Nekrich [22] to 
achieve 0(|P| + K) query cost, which is in a way optimal. Several other approaches for top-i^ docu- 
ment retrieval have recently been published. Some use, instead of linear space, succinct space [SP3] or 
semi-succinct space |29|ll|23|14|5j . Their query costs, however, usually contain a multiplicative poly- 
logarithmic factor to the output size K (or ndoc). This problem is seeing a burst of research activity 
in both mainstream venues for algorithms and information retrieval |15|17|22|24j as well as plenary 
talks |23ll6j in the string matching community. 

All the above approaches use a two-phase procedure to answer a query. The first phase identifies 
the locus of P in a suffix tree, that is, the node corresponding to the pattern P. The second phase finds 
the top-K results in the subtree rooted at the locus. |15j and |22j reduce the Phase-2 subproblem to a 
3d orthogonal range searching problem with four constraints. While general four-constraint orthogonal 
range searching is proved hard [9], the desired bounds can nevertheless be achieved by identifying 
a special property that one dimension of the reduced subproblem can only have |P| distinct values. 
Employing this property, an additive 0(|P|)-term inevitably appears in the cost to handle Phase-2, 
which is actually sub-optimal. In this paper, we shall prove that Phase-2 can be answered strictly in 
0{K) time in the RAM model. Our techniques also have implications in the external memory (EM) 
model. Following are some motivating applications: 

1. None of the existing approaches work in external memory. Here, we need the \P\/B factor in I/O 
to be optimal. Thus, it is expensive to have additive 0(|P|) factor. Even in RAM, the optimal 
pattern matching time (based on word-packing) is taken as 0(|P|/ log|j;| + loglogA^). In this 
sense, 0(|i-*| + K) bound is not optimal even in RAM. 



2. In applications like cross-document pattern matching [18j, pattern P is given by a location in some 
document and is needed to be found in other documents. Since the collection can be pre-indexed, 
the locus of the pattern can be found in O (log log A^) using weighted level ancestor query (or even 
in faster 0(loglog|P|) time). Thus, an additive 0(|P|) factor can be unaffordably expensive. 

3. Autocompletion has become an indispensable component of modern search engines. For example, 
in Google Instant"^^^ instead of waiting for a user to complete a query before the search starts, 
relevant results will be rendered in real time as the user types. In the view of a server, if the user 
types a string P, this procedure can issue up to |P| queries, one for each prefix of P. Therefore, 
answering a Phase 2 query in 0(|-P| + K) time leads to an 0(|Pp) term in the overall query cost. 

4. In many pattern matching applications, for example in suffix-prefix overlap [28j or maximal sub- 
string matches [19], multiple loci are searched with amortized constant time for each locus. In such 
situations, having extra 0(|P|) from the retrieval part leads to non-optimal solutions. 

1.1 Related Work, Problem Complexity and Our Approach 

For the document listing problem, Muthukrishnan gave a somewhat optimal solution, which uses linear 
space and 0(|-P| + ndoc) query cost |21j. As the overall number of the occurrences of P can be much 
larger than ndoc, he uses the idea of chaining, by which a one-sided constraint on a particular dimension 
guarantees that no document can be enumerated more than once. With proper labeling, the pattern can 
be converted into a 2d orthogonal three-sided query. Hon et al. extend the idea to tree-shaped chaining, 
which can be used to solve the top- if document retrieval problem [T5]. Here, the extra top-K constraint 
can be converted to a threshold in the third dimension, thus making the query four-constraint query in 
3d space. 

Both [15] and [22] use the fact that on one of the dimensions, the set of the geometric points 
related to query P have only |P| distinct values. There are some other properties that potentially 
also ease the problem, compared to the general orthogonal range searching problem. One of them is 
that two of the constraints form a tree range, that is, the range of the pre-order ranks of the subtree 
rooted at a node in the tree. This limits the number of possible ranges to 0{N) instead of 0{N'^). 
Chien et. al. [lOj show, however, that any general range can be broken into a logarithmic number of 
tree ranges, which implies that the advantage of this speciality can improve only an additive poly- 
logarithmic term. Another speciality is that the third constraint always has the same boundary as the 
first constraint. By illustrating the first two constraints on the x-dimension and the third constraint on 
the y-dimension, the query projects to a hinged three-sided window, that is, one of its corner lies on the 
diagonal y = x. The top- if constraint translates to the fourth constraint on the third dimension. This 
is what is exploited in our external memory solution, which is near optimal. We further combine the 
property of hinged windows with tree ranges to show that these four-constraint queries can be broken 
into O(logA^) three-sided subqueries, which leads to novel results in the cache-oblivious model. 

1.2 Our Results 

In RAM, Phase-1 (i.e., finding the locus of pattern P), can be processed in either 0(|P|) time with 
a suffix tree, or in 0(|P|/ log|2;| N + loglogA^) time with a word-packed suffix tree. In the external 
memory or cache-oblivious model, Phase-1 can be answered in 0{\P\/B + log^ A'^) I/Os with a string 
B-tree [7]. We summarize our results as follows. 

1. In RAM, there exists an 0(A^)-word data structure that solves Phase-2 of the top- if document 
retrieval problem in optimally 0{K) time. This result improves the previous work [15i22j by elimi- 
nating the additive term |P|. 

2. In the external memory model, there exists an 0(A^)-word structure that solves the top- if document 
retrieval problem in 0(log^ A^+log log i? + if/i?) I/Os. This is surprising because the bound is closer 
to the three-constraint orthogonal range searching problem, instead of the four-constraint one (as 
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the problem appears to be). Further the optimal 0{log^ N + K/B) query I/Os can be achieved 
using an almost-linear O ( log log log -B)- word space structure. 
3. In the cache-oblivious model, there exists an 0(A^)-word structure that solves the document listing 
problem in 0(log + ndoc/B) I/Os. Notice that this problem is at least as hard as the interval 
stabbing problem, and at most as hard as the three-sided orthogonal range searching problem. While 
0(log A^) term does not look optimal, no better result is known for the interval stabbing problem in 
this model. On the other hand, it has been proved that any 0{\o^^^^ N + ndoc/ B)-l/0 structure 
must take super-linear Q{N log'' N) space [2]. This shows that the document listing problem is, by 
hardness, closer to the interval stabbing problem than to the three-sided orthogonal range searching 
problem. For top-iT document retrieval problem, also one can derive new bounds nearly similar to 
3-sided query bounds (using super-linear space) in the cache-oblivious model. 

2 String Retrieval Framework 

This section briefly explains the framework of Hon et. al. fT5]. Define score{P, d), the score of a document 
d with respect to a pattern P, to be the relevance of d to P, which is a function of the locations of all P's 
occurrences in d. The generalized suffix tree (GST) of a document collection V = {di, d2, ds, . . . , di)} is 
the combined compact trie (a.k.a. Partricia tree) of all the nonempty suffices of all the documents. Use 

to denote the total length of all the documents, which is also the number of the leaves in GST. For 
each node u in GST, consider the path from the root node to u. Let depth{u) be the number of nodes 
on the path, and prefix(u) be the string obtained by concatenating all the edge labels of the path. For 
a pattern P that appears in at least one document, the locus of P, denoted as up, is the node closest to 
the root satisfying that P is a prefix of prefix{up). By numbering all the nodes in GST in the pre-order 
traversal manner, the part of GST relevant to P (i.e., the subtree rooted at up) can be represented as 
a range, called the suffix range of P. 

Nodes are marked with documents. A leaf node £ is marked with a document d £ D if the suffix 
represented by i belongs to d. An internal node u is marked with d if it is the lowest common ancestor 
of two leaves marked with d. Notice that a node can be marked with multiple documents. For each 
node u and each of its marked document d, define a link to be a quadruple {origin, target, doc, score), 
where origin = u, target is the lowest proper ancestoij^ of u marked with d, doc = d and score = 
s core [prefix (u) , d^ Two crucial properties have been identified in |15j. 

Lemma 1 The total number of links is upper bounded by 0{N). 

Lemma 2 For each document d that contains a pattern P, there is a unique link whose origin is in the 
subtree of up and whose target is a proper ancestor of up. The score of the link is exactly the score of 
d with respect to P. 

The top-K document retrieval problem can be thus reduced to the problem of finding the top-K 
links that originate in the subtree oi up and target at a proper ancestor of up. With the nodes in GST 
numbered in the pre-order traversal order, these constraints translate into finding all the links (i) the 
numbers of whose origins fall in the number range of the subtree of up, and (ii) the numbers of whose 
targets are less than the number of up. Regarding constraint (i) as a two-sided range constraint on 
x-dimension, and constraint (ii) as a one-sided range constraint on y-dimension, the problem asks for 
the top-K points that fall in a three-sided window in 2d space. Furthermore, as the left endpoint of the 
range in (i) always equals the endpoint of the range in (ii) , one corner of the three-sided window must 
be on the diagonal y = x. We thus name the resulting problem as top-K hinged range reporting. 

3 Linear Space, 0{logN + K) Retrieval Time Data Structure 

Once the locus tip of a pattern P has been identified, Hon et. al.'s structure retrieves the top-K 
documents in 0(|P| + KlogK) time [15j , while Navarro and Nekrich's take 0(|i^| + K) time [22]. We 

^ Define a dummy node as the parent of tlie root node, marked with all the documents. 



shall propose structures with the query cost independent of |P|. This section achieves a complexity of 
0(log + K) by employing a GST node-numbering scheme based on the centroid path decomposition 
idea of Sleator and Tarjan |27j . This complexity is 0{K) and thus is optimal when K > logA^. The 
solution for the case K < log is left to the next section. 

3.1 Centroid Path Decomposition-Based Traversal 

Define the weight of a node u in a rooted tree T to be the number of nodes in the subtree rooted 
at u. For each internal node n, define its successor as its child with the maximum weight (break ties 
arbitrarily) . The centroid path decomposition of tree T is the spanning subgraph of T whose edge set is 
all the predecessor-successor edges [27] . The name comes from the fact that the resulting graph consists 
of only disjoint paths, called centroid paths. A crucial property of this decomposition is that every path 
in T can intersect only 0(log A^) centroid paths. 

Consider the following tree-traversal algorithm defined in a recursive manner, starting at the root. 

The traversal of the subtree rooted at a node u is done by first visiting node u, then recursively 
traversing the subtrees of the children of u. The children are ordered in such a way that the 
successor of u is traversed the latest . 

Define the centroid rank, denoted as c-rank{u), of a node u to be integer i if is the i-th visited node 
in the aforementioned traversal algorithm. Let c-path{u) be the centroid path a node u belongs to; and 
c-depth{u) be centroid depth, namely, the number of the centroid paths intersected by the root-to-u 
path in T- 

We can identify the following properties. 

Property 1. For each node u £ T, the centroid ranks of all the nodes in its subtree form a range 
[c-rank{u), c-rank{u) + \subtree{u)\\. 

Property 2. Let u to be a proper descendant of a node v on the same centroid path. Each descendant 
of u have a larger centroid rank than all descendants of v, excluding those in the subtree of u. Formally, 
u' G subtree{u) and v' G subtree{v)\subtree{u) implies c-rank{u') > c-rank{v'). 

3.2 The Search Structures 

The reduction of [15j can be adopted to work with the numbering scheme from the centroid path 
decomposition-based traversal algorithm, instead of pre-order traversal. Perform the aforementioned 
traversal algorithm on GST. Then, a link (origin, target, doc, score) qualifies a query pattern P if and 
only if (i) c- rank (origin) G [c- rank (u p), c- rank (up) -\-\ subtree (up) \) , (ii) c- rank (tar get) < c-rank(up), 
and (iii) its score is among the top K of all the links that satisfy (i) and (ii) . We say a link is a candidate 
of P if it satisfies the first two requirements. Next, we categorize all the candidates of P into two types. 

— Interpath links: those links satisfying c-depth(target) < c-depth(up). 

— Copath links: those links with c-depth(target) = c-depth(up) and c-rank(target) < c-rank(up). 

In the part of this section, we shall figure out the top-i^ results for the two types individually. The 
combination of the two result sets can be achieved by a single merge in 0(K) time. 

3.2.1 Processing Interpath Links We shall decompose the query into O(logA^) subqueries. Each 
can be served by answering an online sorted range reporting query. In the online sorted range reporting 
problem, an array A is indexed so that given a query (i,j), the entries in the subarray A[i..j] can 
be reported in sorted order one by one until the user terminates the reporting. Brodal et. al. [8j 
proposed a linear-space structure that achieves 0(1) cost per entry. The top-K results among interpath 
links thus can be obtained by an 0(logA^)-way merge. Since the number of elements in the heap for 
merging is 0(log A^), an atomic heap |13j can do each heap operation in 0(1) time, leading to an overall 
0(log A^ -|- K) retrieval time. 



Fig. 1. Example of centroid paths, interpath and copath links. 



It remains to describe how to obtain and serve the subqueries. Observe that c-depth(-) can have only 
0(log A^) distinct values, meaning that we can afford to enumerate all possible values of c-depth{target) 
that satisfy c-depth{target) < c-depth{up), one for each subquery. For each depth 6 = 1,2,..., 0(log A''), 
define Ag to be the array of all the links whose c-depth{target) = 6, ordered by their c-rank{origin). 
We also keep a multiset Bs which consists of all the c-rank{origin) of the links in As, to support the 
conversion from the query range ^c- rank {u p), c- rank (up) + \subtree{up)\^ to the subrange [^(5,j(5] on 
As- Namely, is exactly the set of the links in As whose c-rank{origin) fall in the query range. 

Then, the subquery at level 6 can be obtained by computing is (resp., js) as the rank of c-rank{up) 
(resp., c-rank{up) + \suhtree{up)\ — 1) in Bs- It can finally be served by an online sorted range reporting 
query {is,js) on the array As. 

Analysis. We can store each weighted array As with the Id top- A' range reporting structure of [8], 
and multiset Bs with the succinct dictionary of [26]. The decomposition into the O(logA^) subqueries 
can be achieved in overall O(logA^) time since the computation of the ranks in Bs costs 0(1) time 
each. The retrieval time, as mentioned before, is 0(log + K). Therefore, the query cost of this part 
is 0(log A'" + K). To analyze the space consumption, all the structures of As take Yls 0{\As\) = 0{N) 
words. As each Bs contains at most 2A^ — 1 numbers in the domain {1, . . . , 2A^ — 1}, its dictionary 
consumes 0{N) bits, meaning that all dictionaries consume 0(A^log A^) bits, or 0{N) words. Therefore, 
our structure occupies 0(A^)-word space. 

3.2.2 Processing Copath Links A candidate copath link (origin, tar get, doc, score) must satisfy 
c- rank (origin) > c-rank(up). Otherwise, as target is a proper ancestor of up on the same centroid 
path, by Property [2j origin cannot be in the subtree of up. On the other hand, if c- rank (tar get) < 
c-rank(up) < c- rank (origin), the link is a copath candidate. Therefore, the retrieval of the copath 
links can thus be converted into a top-i^T variant of the traditional Id interval stabbing query |6]. 
Specifically, for each centroid path vr, let A.,^ be the set of the links whose targets are on vr. For 
each link (origin, target, doc, score) G A.,^, construct a weighted interval with (i) the left endpoint 
c- rank (tar get) + 1, (ii) the right endpoint c- rank (origin) and (iii) the weight score. Indexing the in- 
tervals properly, the query over the copath links can be served by first identifying the centroid path 
TT = c-path(up), then retrieving the top- if intervals from A^^ that are stabbed by c-rank(up). 

It remains to index the weight intervals of a set A^. Consider a sweeping line that continuously 
moves from — c« to -|-oo, on which a single-linked list is maintained to keep track of all the intervals 
that currently intersect the sweeping line. The intervals in the linked list are sorted in descending 
order of their weights. As the sweeping line encounters the left (resp., right) endpoint of an interval. 



it is inserted into (resp., deleted from) the linked list. For any stabbing query c-rank(up), there must 
be a moment at which the first K elements of the linked list are just the answer. To support query 
answering on all the snapshots, the linked list can be implemented with the persistent linked list |12j . 
This structure guarantees that at any snapshot, once the list head has been identified, the linked list 
can be traversed in 0(1) time per element. Therefore, the top-K intervals can be retrieved by first 
finding the list head of the correct snapshot, which is a predecessor search; then traversing the linked 
list at the snapshot. 

Analysis. The persistent linked list of the intervals for each At^ takes 0(174,^1) words, implying that 
the overall space consumption is 0(|j47r|) = 0{N) since no link appears in two different sets A.,^. It 
takes 0(loglog |^7r|) = 0(loglog A^) time to identify the list head in the persistent structure, and 0{K) 
time to report the K intervals. Therefore, the overall query cost is 0(loglog + K). 

Lemma 3 There exists a linear space data structure taking 0{\ogN + K) time for top-K document 
retrieval queries once the locus of pattern is known. For K > logN , this takes 0{K) time. □ 

4 Linear Space, Optimal 0{K) Retrieval Time Data Structure 

The data structure described in the previous section is optimal for K > logA^. In this section, we 
provide another linear space structure, which is optimal for any K < logN, thus capturing all cases. 

Marked nodes and Prime nodes in a tree: Given a tree 7" (of no single child node) of n leaves, we 
identify certain nodes in 7~ as marked nodes with respect to on a parameter g called grouping factor. 
The procedure starts by combining every g consecutive leaves (from left to right) together as a group, 
and mark the lowest common ancestor (LCA) of first and last leaf in each group. Further we mark the 
LCA of all pairs of marked nodes recursively. Additionally, we ensure that the root is always marked. 
At the end of this procedure, the number of marked nodes in T will be 0{n/g) [15j. Every child of a 
marked node is called a prime node. For any marked node u*, there is a unique prime ancestor node 
u' . In case m*'s parent is marked then u' = u* . For every prime node u' , the corresponding marked 
descendant u* (if it exists) is unique. If u' is marked then the descendant u* is same as u' . 

4.1 The structure 

Using the above scheme, we perform the marking of nodes in GST, with grouping factor g = log A^. Let 
u' be a prime node and let u* (if it exists) be the unique marked descendant of u' . Then, all the links 
originating from the subtree of u' are categorized into the following. 

1. fringe-links: The links originating from the subtree of u' , but not from the subtree of u* . 

1. near-links: The links originating from the subtree of u* whose target is within the subtree of u' . 

3. far-link: The link originating from the subtree of u* whose target is a proper ancestor of u' . 

4. small-link: The links with both origin and target within the subtree of u* . 

Lemma 4 The number of fringe-links and the number of near-links of any prime node u' is 0(g). 

Proof. There are at most 2g leaves in subtree{u')\subtree{u*). Only one link for each document comes 
out of the subtree{u*). Therefore, the number oi fringe-links can be bounded by Ag. For every document 
d whose link originates from subtree{u*) going out of it, it ends up as a near-link if and only if d exists 
at one of the leaves of subtree{u')\subtree{u*) . Thus, this can be bounded by Ag too. In the case that 
u* does not exist for u' only fringe- links exist and since the subtree size of u' is 0(g) there can be no 
more than 0(g) of these links. □ 

Consider the following set, consisting of 0(g) links with respect to u': all fringe-links, near-links 
and g highest scored far-links. For any node u, whose closest prime ancestor (including itself) is u', the 
above mentioned set is called candidate links of u. We maintain this candidate links at u' . From each 
n, we maintain the pointer to its closest prime ancestor where the list of candidate links is stored. 



Lemma 5 The candidate links of any node u contains top-g highest scored links among those with 
origin below the subtree of u and target above the subtree of u. 

Proof. Let u' be the closest prime ancestor of u. If no marked descendant of u' exist, then all the links 
are stored as candidate links. Otherwise, small-links can not ever be candidates as they never cross u. 
Now, if u lies on the path from u' to u* then all far-links will satisfy both origin and target conditions. 
Else, far-link do not qualify. Hence, any link which is not among top-g' (highest scored) of these far-links, 
can never be the candidate. □ 

We maintain all these candidate links in the sorted order of score (as a list called candidate list), 
and maintain a pointer to it from all those node u whose top-g links belongs to this collection. Note 
that the candidate list of many nodes can be the same, and it consists of 0{g) links. To filter out the 
iop-K links corresponding any node u, we maintain additional structures. 

Let Bu be a bit vector of length 0{g) associated with node u, such that Bu[i] = 1 if and only if 
the ith highest scored link in the candidate list of u (maintained at u') is a valid link for u. A link is 
said to be valid with respect to a node u if its origin is in the subtree of u and target above u. Since 
this bit-vector is of length 0{g) = 0{logN), we can easily implement rank/select structure on it (using 
tables) which can give answers in 0(1) time. 

4.2 Query Answering 

In order to answer the top-K query (for K < g) corresponding to a locus node up, we just retrieve 
those select{Bup,i)th highest scored link in the candidate list of up (stored at its prime ancestor u'p) 
for i = 1,2, 3, ■■.,K, where select{Bup, i) returns the position of ith 1 in the bit vector Bu- These select 
queries give the positions corresponding to the location of a valid links for up. Since the links are sorted 
in the score order, we get the top-K answers in sorted order. 

Space-Time Analysis: The total space for maintaining 0{g) words candidate links at every 0{N/g) 
marked nodes is 0{N). By choosing g = log A^, the total space of bit vectors associated with all nodes 
can be bounded by O(A^logiV) bits, and the number of pointers is also 0{N). The query time is 0{K) 
as each select query takes only constant time. 

Lemma 6 There exists a data structure taking 0{N) space which can answer top-K document retrieval 
queries in optimal 0{K) time, for any K < log A^. □ 

Combining Lemma [6] with Lemma [3] we get our main theorem. 

Theorem 1 There exists a data structure taking 0{N) space which can answer top-K document re- 
trieval queries in optimal 0{K) time, once the locus node of the pattern is known. □ 



5 External Memory Data Structures 

It is known that no linear-space external memory structure can answer the (even the simpler) Id top-K 
range reporting query in 0(log*^*-^^ N + K/B) I/Os if the output order must be ensured. We thus turn 
our attention to solving the unordered variant of the top-K document retrieval problem in the external 
memory and cache-oblivious models. Namely, the K results can be returned in an arbitrary order. 

As the reduction from top-K document retrieval to top-K hinged range reporting still work in 
external memory (refer to Section [2]) , this section further converts the top-K requirement into a one- 
sided score constraint, called a threshold, then solving the resulting problem with a divide-and-conquer 
idea. The problem can be formally stated as follows. 

Problem 1. Index a set S of 3d point^ so that a query (a, b, r) returns all points (x, y,z) £ S satisfying 
y < a < X < b and z < t. 



The third dimension comes from the scores. 



5.1 Converting Top-K to Threshold via the Logarithmic Sketch 

By setting the grouping factor g again to logA^ and computing marked and prime nodes as in Section 
4, a similar query answering scheme can be adopted. Specifically, given the locus up of a query P whose 
lowest prime ancestor is u' , we process fringe-, near- and far-links with respect to u' individually. The 
fringe- and near- links can be handled by simply scanning and checking all of them against the query 
constraints, and finally returning the K links with the highest scores (in case less than K links qualify, 
return all of them). This takes only 0{g)/B = O(^) = ©(log^ N) I/Os. 

The processing of far-links, however, is more sophisticated. As the number far-links with respect 
to u' can be much larger than log A^, we cannot afford storing all of them. Instead, we keep a logarithmic 
sketch of the far-links of u' . Namely, the sketch consists of the scores of the far-links that rank the 
first, second, fourth, eighth, and so on. The top-K results among the far-links can be retrieved by the 
following steps. First, find the score r that ranks the 2l^^°^^^-th, which has been stored in the logarithmic 
sketch on u' . Then, with the found r and the tree range of up, issue a query of Problem [l] over all links. 
This step will return, instead of top- AT, top-0{K) links. 

The combination of the two result sets can be done by a JC-selection algorithm over the K + 0{K) = 
0{K) links. Since a logarithmic sketch takes 0(log A)-word space, and the number of prime nodes is 
0{N/g) = 0{N/ logN), the overall space consumption, excluding the structure of Problem[l| is linear. 
And the query cost, excluding the subquery of Problem [l| incurs 0(logg N + K/B) I/Os. Therefore, to 
achieved the claimed bound, it remains to solve Problem [l] with a linear-space structure that answers 
a query in 0(logp N + log log i? + K/B) I/Os. 

5.2 Near I/O-Optimal, Linear Space Data Structure 

5.2.1 Small-Grid Structure Given an additional restriction to Problem[l]that every point {x, y, z) G 
S satisfies x,y £ {1,2, . . . , U}, this subsection proposes a data structure that takes 0{\S\ -\- B)-woTd 
space, and answers a query in 0(log^ IS*! + K/B) I/Os. Such a structure will require the following 
toolkit. 

Lemma 1 (Persistent Sorted List). Given an update sequence (of insertions and deletions) on a 
total-ordered set, there exists a linear-space data structure that can retrieve the Z > minimum elements 
at any version in 0(1 -|- Z/B) I/Os. Here, a version is the content of the set after an update. 

Proof, (sketch) Consider a block-based linked list implementation. Each node in the linked list is a 
block, which stores (i) B /2 to B elements, and (ii) a pointer to its next node in the linked list. Then, 
insertions and deletions can be handled easily: if an insertion to some node makes it contain more than 
B elements, split the node into two nodes; if a deletion makes it contain less than B/2 node, merge it 
with its predecessor or successor (both work), then split the new node if necessary. This structure can 
be easily made persistent with standard persistent techniques [12|4| . 

Now, for each integer xq € {1, . . . , f/}, apply Lemma [T] on all the points whose x-coordinates are 
xq: regard each point {xQ,y,z) as an element with sorting key z, inserted at time y and never deleted. 
Therefore, we have obtained U persistent sorted lists. Based on them, a query (a, h, r) can be answered 
in 0{U -\- K/B) I/Os: for every persistent sorted list whose points have x-coordinates in [a, b], first pick 
the version on time a, then report all the elements whose keys are below r. 

It remains to replace the term U with log^ IS*! in the query cost. To achieve this, we will adopt 
the idea of the external memory priority search tree [3] to build some "top" structures. For each 
xo,yo G {l,...,U}, let 

S[xo, < yo] = {{x, y,z) e S \ X = xo,y < yo} 

and 

S[<yo] = U top-B{S[x,<yo]), 

x>yo 

where top-Z(Q) represents the Z points in Q with the minimum z-coordinates. In other words, S[< yo] 
picks the top-i? elements from each relevant persistent sorted list at version yo- Then, we can answer 



the query (a, b, r) against S[< a] first. If less tlian B points are reported for a specific xq, we know tliat 
all resulting elements in that persistent sorted list have been reported; thus, we can skip it. If, on the 
other hand, at least B points are reported, we can afford querying the persistent sorted list since we 
will report at least B points from it. We have thus eliminated the term U, but introduced a new term 
due to the search of S[< a]. 

To efficiently retrieve the points in S[< a] that qualifies, we know that each point {x, y, z) G S[< a] 
already satisfies y < a < x. So we only need to issue a quadrant query to find all the points satisfying 
X < b and z < t. We use the external memory priority search tree, whose query cost is 0(log^ \ 
a]\ + Z/B) if Z points qualify. Therefore, the overall query cost is 0(log^ |5| + K/B). 

To analyze the space consumption, observe that no two persistent sorted lists share the same point. 
Thus, the persistent sorted lists consume Od^l) space. Then, for each yo £ {^i ■ ■ ■ lU}-, an external 
memory priority search tree is built, which takes 0(UB) words. Overall, the structure occupies Od/SI + 
W^B) words. 

5.2.2 A General Structure This subsection bootstraps the data structure proposed in the previous 
subsection to solve Problem [T] in the general setting. Consider the following divide-and-conquer scheme. 
Let A^' = N'^/'^B^/^ . Partition the point set 5 into 5i, 5*2, . . . , Sl according to the x-coordinates, such 
that (i) each set Si contains N'/2 to A^' points for i = and (ii) all points in Si have less 

x-coordinates than any point in S'j+i for i = 1, ... ,L — 1. Denote by 6i {i £ [1,-^^]) the minimum x- 
coordinate in S'jJ^ Therefore, 82,63, ■■■ ,6l partitions the 3d space into L vertical "slabs". We further 
partition the slabs according to planes y = 82, y = S3, . . . , y = 6l- As a result, the whole space is 
partitioned into 0{L^) cells. Figure p[a) illustrates the xy-projection of the partition. Now, given a 




(a) (b) (c) 

Fig. 2. Divide-and-conquer scheme in EM. Only x- and y-coordinates are illustrated. 

query Q = {a,b,T), we decompose it into five subqueries Qi, Q2, . . . , Q5 as illustrated in Figure [2]^b) . 
Formally, let i,j be integers that satisfy 5i < a < 5i-^-l and 5j < b < Then, the subqueries are 

defined as follows. 

— Qi is the part of Q whose xy-projection falls in [6i,Si+i) x [5i,6i-^-l). 

— Q2 is the part of Q whose xy-projection falls in [6i,Si^i) x [— 00, 5i). 

— Q3 is the part of Q whose xy-projection falls in x [6i,6i-^-l). 

— Q4 is the part of Q whose xy-projection falls in [6j,5j+i) x [—00, 6i+i). 

— Q5 is the part of Q whose xy-projection falls in [6i+i,6j) x [— 00, 6i). 

It is also possible that i = j. This degenerated case is shown in Figure ^c), in which only Qi and 
Q2 may contain result points. Either way, in general, subqueries Q2, Q3 and can be answered 
with either 2d 3-sided range queries |3j or 3d dominance search [1] (recall that the z-coordinates are 
not illustrated here). Therefore, they can all be answered in 0(log^ A -|- Z/B) I/Os if Z points are 



^ For notational convenience, specially define Sl+i = -|-cx3. 



retrieved. Furthermore, Qi is a subproblem with exactly the same definition as Problem [T| but with 
problem size A^' instead of A^. 

It remains to consider Q5. A crucial observation is that all its three boundaries (about x- and y- 
coordinates) are on the partition planes. Furthermore, the left boundary (i.e., plane x = 6i+i) is always 
the successor of the top boundary (i.e., plane y = 5i). Therefore, we can reassign the x- and y-coordinates 



of the points such that becomes a small-grid problem which has been solved in Section 5.2.1 The 
reassignment works as follows. Given a point (x, y, z), reassign x to integer k \i 6k < x < (^fc+i; reassign 
y to integer k + 1 \i 5k < y < 5k+i- Therefore, can also be answered in ©(log^ N + Z/B) I/Os. As 
no point is reported twice, by ignoring the necessary 0{K/B)-l/0 term to retrieve 0{K) documents, 
the remaining part of the query cost is given by the following equation. 

T{N) = T{N^/^B^/^) + 0{l + logBN). (1) 

The recursion terminates with T{N) = 0(1) for N = 0{B). 

From now on, we use A'^o to denote the initial problem size, and N to denote the current problem 
size (due to the recursion). To solve ([T]), 

r(iVo) = T{{No/B) ■ B) 

= T{{No/Bf^ .B)+0{1 + logB{No/B)) 

= T{{No/B)^^/^y' .B)+0{1 + logsiNo/B)) + 0(l + logM/Bf/') 
= ol Yl l + (2/3)'=logB(iVo/i?) 

\fc=0,l,... 

= 0(loglog7Vo + logB(iVo/5)), 

where the term log log A'^o comes from the fact that the recursion level is upper bounded by 0(log log A'o) . 
Let /i = log^ A'^o- Then, 

log log A'^o + logs A^ = log log(i?'') + /i = log /i + log log i? + /i = 0(log log B + /i). 

We have thus proved the desired query bound. 

Space Consumption In the divide and conquer, observe that the point sets used in answering subqueries 
Q2, ■ ■ ■ , Q5 are totally disjoint with the subproblem of Qi. In other words, there is a "non-duplication" 
property here: once a point appears in a structure to answer any subqueries Q2, ■ ■ ■ , Q5, it does not 
appear in any recursive subproblem. Therefore, by traversing the recursion tree and counting the space 
consumptions of the structures for these subqueries, this part contributes a cost of 

^ {OiN,) + 0{N, + U^B)), 

node V 

where Ny is the number of points used in the recursion node v, and Uv is the parameter L at v. The 
summation of 0{N^) is 0{Nq). We now compute the summation of 0{U^). As A^' = N^^^B^^^ at 
node V, parameter L > 2N^/N' < 2{N^/ B)^/'^ . Therefore, at the root level, we have one term with 
[/2 = A{Nq/B)'^/'^] at one level down, we have at most 2A^o/A^i terms with C/^ < 4(iV^/i^)2/3 ^^gre 
Ni/B = {Nq/ B)'^/^] in general, at level / (level is the root level), we have at most 2Nq/Ni terms with 
< 4:{Ni/Bf/^ where Ni/B = {Nq/ B)^'^/^'^' . Therefore, the summation of C/^ at level / is at most 

{2No/Ni) . A{Ni/Bf/^ = 8{No/B){B/Nif/' = 8{No/ B){B /No)^'/^'>'^\ 

Notice that the term {B /Nq)^^/'^'^''^^ degenerates exponentially as / increases. Therefore, their summa- 
tions over / is dominated by the first term 8{No/ B)'^/''^ , meaning that 0{U'^B) = 0{{No/B)'^/^B) = 
0{No). 

Combining with string B-tree |7], we get the following result. 

Theorem 2 There exists a linear space data structure taking 0{N) words of space, which answers 
top-K (unsorted) document retrieval queries in 0{\P\/B + log^ A^ -|- loglogi? + K/B) I/Os. □ 



5.3 I/O-Optimal, Almost Linear Space Data Structure 

Clearly the data structure in sec 5 is optimal for K = log log i?). The case when K < Slog log -B 
can be handled separately based on the result in lemma [5] as follows: we maintain the candidate links 
as a list in the sorted order of score. Then top-K documents {K < g, the grouping factor) can be 
retrieved in 0(1 + g/B) I/O's in the sorted order of score by scanning all candidate links and reporting 
those which satisfy the origin-target conditions with respect to up. Note that the query time is optimal 
when K = 0(g), and the space requirement is 0{N) words. Therefore by maintaining the above 
described linear space structure for g = B,2B,AB, ...,BloglogB, the top- if query corresponding to 
any K < i? log log i? can be answered optimally in 0(1 -|- K/B)-l/0^s by querying on the structure 
corresponding to K < g < 2K. 

Theorem 3 There exists an almost-linear space data structure of 0{N logloglog B)-word space, which 
answers top-K (unsorted) document retrieval queries in optimal-0{\P\ / B + log^ N + K/B) I/Os. □ 



6 Cache-Oblivious Data Structures 

Our cache-oblivious results are derived using the internal memory framework of sec 3 and 4, where 
all internal memory data structures are replaced by the best known corresponding cache-oblivious 
counterparts, and we get the following result. 

Theorem 4 There exists cache- oblivious data structures for the top-K (unsorted) document retrieval 
problem with the following space-I/0 trade-off's. 

- 0{N^/logNloglogN)-word space and 0{log N log^ N + K/B)-I/Os. 

- 0{N ^/logNlog'^ log N)-word space and 0{\og\ogN log^ N + K / B)-I/Os. 

Proof, (sketch) The iop-K document retrieval problem can be converted to its threshold version via the 
logarithmic sketch (sec 5.1). Using the framework in sec 3, the original problem can be decomposed into 
0(log A^) three-sided queries and a 3-d dominance query (general case of stabbing intervals with score). 
By substituting an 0(A^vlogiVloglogiV)-word space and 0(log^iV + K/B)-l/Os data structure [2] 
for these sub-problems, the first result can be obtained. As the input range is the same for all O(logn) 
three sided queries, the structures can be combinec^ and reduce the number of three sided queries to 
O(loglogA) with O(loglogA^) blowup in space. This leads to the second result. □ 

Theorem 5 There exists a linear space data structure of 0{N)-word space, which answers document 
listing queries in 0{\.ogN + ndoc/ B)I/Os, where ndoc is the number of documents containing P. 

Proof, (sketch) The document listing problem (i.e., without score constraint) can be reduced to scanning 
and reporting of elements from 0(log A^) lists, hence can be served in 0(log + ndoc/B) I/Os. □ 



Consider a balanced binary tree of logA'^ leaves, thus 0(loglog7V) height. The three-sided range reporting over the 
links corresponding to c-depth{target) = z is stored at leaf i. An internal node u maintain a combined three-sided range 
reporting structure of all those links with c-depth{target) = j, given the jth leftmost leaf is in the sub-tree of u. Here 
each link is a part of 0(loglogA'') structures, hence the space will blowup by an 0(loglogA'') factor. However the the 
number of structures to be searched is reduced to O (log log A^). 
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